Goto

Collaborating Authors

 benchmark test


China beats U.S. with world's fastest supercomputer, but race not geared for AI work

The Japan Times

China beats U.S. with world's fastest supercomputer, but race not geared for AI work Workers at Elon Musk's xAI facility, which houses a large supercomputer known as Colossus, used for Artificial Intelligence (AI) data processing, in Memphis, Tennessee, on Sept. 11, 2025 | REUTERS SAN FRANCISCO - China has overtaken the U.S. to win the top spot on a list of the world's fastest supercomputers, but the results may say more about Beijing's desire to show self-sufficiency in computing systems than its standing in the global AI race, experts said. The LineShine system at the National Supercomputing Center in Shenzhen, China, uses domestically designed chips and won the top spot on the TOP500, a biannual global ranking of supercomputers, with the country's first listing in three years. The ranking comes as the U.S. and China are increasingly competing in advanced computing, with U.S. President Donald Trump on Monday signing an executive order that aims to put the U.S. ahead of China in the emerging field of quantum computing. In the June 2026 edition of TOP500, LineShine beat out the previous titleholder, El Capitan, a supercomputer housed at Lawrence Livermore National Laboratory that the U.S. government uses to develop and maintain its nuclear weapons stockpile. But technology and policy experts said the results do not mean that China has the world's fastest computer for AI work because of changes in the computing industry in recent years and the methods used to compile the list.


Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges

arXiv.org Artificial Intelligence

Isolated digit classification has served as a motivating problem for decades of machine learning research. In real settings, numbers often occur as multiple digits, all written by the same person. Examples include ZIP Codes, handwritten check amounts, and appointment times. In this work, we leverage knowledge about the writers of NIST digit images to create more realistic benchmark multi-digit writer (MDW) data sets. As expected, we find that classifiers may perform well on isolated digits yet do poorly on multi-digit number recognition. If we want to solve real number recognition problems, additional advances are needed. The MDW benchmarks come with task-specific performance metrics that go beyond typical error calculations to more closely align with real-world impact. They also create opportunities to develop methods that can leverage task-specific knowledge to improve performance well beyond that of individual digit classification methods.


KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework

arXiv.org Artificial Intelligence

Large language models have demonstrated remarkable performance across various tasks, yet they face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions. To address these limitations, a novel framework has been proposed. This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module. The learnable dense residual connections enhance information flow and feature capture across layers. Within the TransformerX module, large convolutional kernels aggregate semantic information from extensive text segments, while smaller convolutions focus on local word order and syntactic structures. The adaptive activation function dynamically adjusts its parameters based on the semantic features of the input text, improving the model's ability to handle diverse semantic expressions and complex relationships. The multitoken prediction module boosts data utilization and accelerates inference by predicting multiple future tokens. These components significantly enhance the performance and efficiency of large language models.


Chatbots Are Cheating on Their Benchmark Tests

The Atlantic - Technology

Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its "largest and best model for chat yet." Earlier in February, Google called its latest version of Gemini "the world's best AI model." And in January, the Chinese company DeekSeek touted its R1 model as being just as powerful as OpenAI's o1 model--which Sam Altman had called "the smartest model in the world" the previous month. Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak.


Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

arXiv.org Artificial Intelligence

Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.


Training on the Benchmark Is Not All You Need

arXiv.org Artificial Intelligence

The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.


In latest benchmark test of AI, it's mostly Nvidia competing against Nvidia

#artificialintelligence

For lack of rich competition, some of Nvidia's most significant results in the latest MLPerf were against itself, comparing its newest GPU, H100 "Hopper," to its existing product, the A100. Although chip giant Nvidia tends to cast a long shadow over the world of artificial intelligence, its ability to simply drive competition out of the market may be increasing, if the latest benchmark test results are any indication. Did you miss out on Black Friday 2022? No problem: Cyber Monday deals are here, with internet retailers offering their lowest prices of the year. ZDNET is surfacing the latest and best sales online in real time for you to check out now.


Nvidia's Impressive H100 MLPerf Benchmark

#artificialintelligence

In the complex world of AI/ML processing, it can be hard to compare products from various vendors due to the wide range of models and workloads in use. MLPerf is a consortium of major industry players and research organizations that provides agreed-upon benchmark tests to try and standardize test results across various vendor offerings to give users a chance to evaluate competing performance claims. Nvidia has previously provided MLPerf test results for its A100 product. It has just released its MLPerf benchmarks for its new high end device, the H100. It sports an impressive 6.7X performance gain over the older A100 devices in certain workloads, and is still being optimized with software that could eventually push the performance even higher.


Statistical Tests for Comparing Classification Algorithms

#artificialintelligence

Comparing prediction methods to define which one should be used for the task at hand is a daily activity for most data scientists. Usually, one will have a pool of classification models and will validate them using cross-validation to define which one is best. Another goal, however, is not to compare classifiers, but the learning algorithms themselves. The idea is: given this task (data), which learning algorithm (KNN, SVM, Random Forests, etc) will generate more accurate classifiers on a dataset of size D? As we will see, every method presented here has some pros and cons. However, the first intuition of using a two proportions test can lead to some really bad results.


Facebook: Here comes the AI of the Metaverse

#artificialintelligence

To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective." To that end, the company on Thursday announced Ego4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective. The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook's Kristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video." Grauman is a scientist with the company's Facebook AI Research unit.